2025-05-22-12-06
COSMIC: Enabling Full-Stack Co-Design and Optimization of Distributed Machine Learning Systems
Abstract
arXiv:2505.15020v1 Announce Type: new Abstract: Large-scale machine learning models necessitate distributed systems, posing significant design challenges due to the large parameter space across distinct design stacks. Existing studies often focus on optimizing individual system aspects in isolation. This work challenges this limitation and introduces COSMIC, a full-stack distributed machine learning systems environment enabling end-to-end simulation and agent-based design space exploration. To facilitate efficient exploration and optimization across the entire stack, we introduce Parameter Set Architecture-an abstraction concept analogous to the instruction set architecture-abstracting away configuration complexities of agent-based search methods. Case studies demonstrate COSMIC's ability to consolidate parameters across multiple layers of design abstraction, discovering eight non-obvious high-performance system configurations across four transformer-based models with up to 175 billion parameters. By optimizing across the stack, COSMIC full-stack optimization delivers 1.50-48.41x higher performance compared to the isolated single-stack optimization.
摘要
大规模机器学习模型需要分布式系统支持,由于不同设计栈间庞大的参数空间,这带来了重大设计挑战。现有研究往往孤立地优化单个系统层面。本研究突破了这一局限,提出COSMIC——一个支持端到端仿真和基于智能体的设计空间探索的全栈分布式机器学习系统环境。为促进跨全栈的高效探索与优化,我们提出了参数集架构(Parameter Set Architecture)这一抽象概念,其作用类似于指令集架构,可消除基于智能体的搜索方法在配置上的复杂性。案例研究表明,COSMIC能够整合跨多层级设计抽象的参数,在四个参数量高达1750亿的基于Transformer的模型中,发现了八种非显而易见的高性能系统配置。通过全栈优化,COSMIC相比孤立单栈优化实现了1.50-48.41倍的性能提升。
Balanced and Elastic End-to-end Training of Dynamic LLMs
Abstract
arXiv:2505.14864v1 Announce Type: new Abstract: To reduce computational and memory costs in Large Language Models (LLMs), dynamic workload reduction schemes like Mixture of Experts (MoEs), parameter pruning, layer freezing, sparse attention, early token exit, and Mixture of Depths (MoDs) have emerged. However, these methods introduce severe workload imbalances, limiting their practicality for large-scale distributed training. We propose DynMo, an autonomous dynamic load balancing solution that ensures optimal compute distribution when using pipeline parallelism in training dynamic models. DynMo adaptively balances workloads, dynamically packs tasks into fewer workers to free idle resources, and supports both multi-GPU single-node and multi-node systems. Compared to static training methods (Megatron-LM, DeepSpeed), DynMo accelerates training by up to 1.23x (MoEs), 3.18x (pruning), 2.23x (layer freezing), 4.02x (sparse attention), 4.52x (early exit), and 1.17x (MoDs). DynMo is available at https://anonymous.4open.science/r/DynMo-4D04/.
摘要
为降低大型语言模型(LLMs)的计算和内存成本,业界提出了多种动态工作负载缩减方案,如专家混合模型(MoEs)、参数剪枝、层冻结、稀疏注意力、早期令牌退出和深度混合模型(MoDs)。然而,这些方法会导致严重的负载不均衡问题,限制了其在大规模分布式训练中的实用性。我们提出DynMo——一种自主动态负载均衡解决方案,可在训练动态模型时通过流水线并行实现最优计算资源分配。DynMo能自适应平衡工作负载,动态将任务打包至更少的工作节点以释放闲置资源,并支持多GPU单节点与多节点系统。与静态训练方法(Megatron-LM、DeepSpeed)相比,DynMo在MoEs场景下训练速度提升达1.23倍,剪枝场景3.18倍,层冻结场景2.23倍,稀疏注意力场景4.02倍,早期退出场景4.52倍,MoDs场景1.17倍。DynMo项目地址:https://anonymous.4open.science/r/DynMo-4D04/。
FOL-Pretrain: A complexity annotated corpus of first-order logic
Abstract
arXiv:2505.14932v1 Announce Type: new Abstract: Transformer-based large language models (LLMs) have demonstrated remarkable reasoning capabilities such as coding and solving mathematical problems to commonsense inference. While these tasks vary in complexity, they all require models to integrate and compute over structured information. Despite recent efforts to reverse-engineer LLM behavior through controlled experiments, our understanding of how these models internalize and execute complex algorithms remains limited. Progress has largely been confined to small-scale studies or shallow tasks such as basic arithmetic and grammatical pattern matching. One barrier to deeper understanding is the nature of pretraining data -- vast, heterogeneous, and often poorly annotated, making it difficult to isolate mechanisms of reasoning. To bridge this gap, we introduce a large-scale, fully open, complexity-annotated dataset of first-order logic reasoning traces, designed to probe and analyze algorithmic reasoning in LLMs. The dataset consists of 3.5 billion tokens, including 8.8 million LLM-augmented, human-annotated examples and 7.5 million synthetically generated examples. Each synthetic example is verifiably correct, produced by a custom automated theorem solver, and accompanied by metadata tracing its algorithmic provenance. We aim to provide a scalable, interpretable artifact for studying how LLMs learn and generalize symbolic reasoning processes, paving the way for more transparent and targeted investigations into the algorithmic capabilities of modern models.
摘要
基于Transformer架构的大规模语言模型(LLMs)已展现出卓越的推理能力,涵盖从编程、数学问题求解到常识推理等多个领域。尽管这些任务的复杂度各异,但均要求模型对结构化信息进行整合与运算。尽管近期已有研究通过受控实验逆向解析LLM行为,我们对其内部实现复杂算法的机制理解仍显不足。现有进展主要局限于小规模研究或浅层任务,如基础算术和语法模式匹配。深入理解的障碍之一在于预训练数据的特性——海量、异构且往往缺乏标注,这使得分离推理机制变得困难。为弥合这一鸿沟,我们提出了一个大规模、完全开放且标注复杂度的一阶逻辑推理追踪数据集,旨在探究和分析LLM的算法推理能力。该数据集包含35亿标记,含880万条经LLM增强的人工标注样本和750万条合成生成样本。每条合成样本均由定制自动定理证明器生成,其正确性可验证,并附带追溯算法来源的元数据。我们期望通过这一可扩展、可解释的数据集,为研究LLM如何学习与泛化符号推理过程提供工具,从而为现代模型的算法能力研究开辟更透明、更具针对性的路径。
Generalised Probabilistic Modelling and Improved Uncertainty Estimation in Comparative LLM-as-a-judge
Abstract
arXiv:2505.15240v1 Announce Type: new Abstract: This paper explores generalised probabilistic modelling and uncertainty estimation in comparative LLM-as-a-judge frameworks. We show that existing Product-of-Experts methods are specific cases of a broader framework, enabling diverse modelling options. Furthermore, we propose improved uncertainty estimates for individual comparisons, enabling more efficient selection and achieving strong performance with fewer evaluations. We also introduce a method for estimating overall ranking uncertainty. Finally, we demonstrate that combining absolute and comparative scoring improves performance. Experiments show that the specific expert model has a limited impact on final rankings but our proposed uncertainty estimates, especially the probability of reordering, significantly improve the efficiency of systems reducing the number of needed comparisons by ~50%. Furthermore, ranking-level uncertainty metrics can be used to identify low-performing predictions, where the nature of the probabilistic model has a notable impact on the quality of the overall uncertainty.
摘要
本文探讨了比较性LLM-as-a-judge框架中的广义概率建模与不确定性估计。研究表明,现有专家乘积方法是更广泛框架的特例,该框架支持多样化的建模选择。我们进一步提出了改进的个体比较不确定性估计方法,可实现更高效的选择,并通过更少的评估次数获得强劲性能。同时,我们提出了一种估计整体排序不确定性的新方法。实验证明,结合绝对评分与比较评分能提升系统性能。具体而言,专家模型对最终排序影响有限,但我们提出的不确定性估计(尤其是重排序概率)能显著提升系统效率,将所需比较次数减少约50%。此外,排序级不确定性指标可用于识别低质量预测,其中概率模型的特性对整体不确定性质量具有显著影响。
When Can Large Reasoning Models Save Thinking? Mechanistic Analysis of Behavioral Divergence in Reasoning
Abstract
arXiv:2505.15276v1 Announce Type: new Abstract: Large reasoning models (LRMs) have significantly advanced performance on complex tasks, yet their tendency to overthink introduces inefficiencies. This study investigates the internal mechanisms of reinforcement learning (RL)-trained LRMs when prompted to save thinking, revealing three distinct thinking modes: no thinking (NT), explicit thinking (ET), and implicit thinking (IT). Through comprehensive analysis of confidence in thinking termination, attention from thinking to generation, and attentional focus on input sections, we uncover key factors influencing the reasoning behaviors. We further find that NT reduces output length at the cost of accuracy, while ET and IT maintain accuracy with reduced response length. Our findings expose fundamental inconsistencies in RL-optimized LRMs, necessitating adaptive improvements for reliable efficiency.
摘要
大型推理模型(LRMs)在复杂任务上取得了显著性能提升,但其过度思考倾向导致效率低下。本研究探究了经过强化学习(RL)训练的LRMs在要求节省思考时的内部机制,揭示了三种不同的思考模式:无思考(NT)、显性思考(ET)和隐性思考(IT)。通过对思考终止置信度、从思考到生成的注意力转移以及输入部分关注焦点的综合分析,我们发现了影响推理行为的关键因素。进一步研究发现,NT模式以降低准确性为代价缩短输出长度,而ET和IT模式能在保持准确性的同时减少响应长度。我们的研究结果揭示了RL优化LRMs中存在的基本不一致性,亟需通过自适应改进来实现可靠的效率提升。
ModelingAgent: Bridging LLMs and Mathematical Modeling for Real-World Challenges
Abstract
arXiv:2505.15068v1 Announce Type: new Abstract: Recent progress in large language models (LLMs) has enabled substantial advances in solving mathematical problems. However, existing benchmarks often fail to reflect the complexity of real-world problems, which demand open-ended, interdisciplinary reasoning and integration of computational tools. To address this gap, we introduce ModelingBench, a novel benchmark featuring real-world-inspired, open-ended problems from math modeling competitions across diverse domains, ranging from urban traffic optimization to ecosystem resource planning. These tasks require translating natural language into formal mathematical formulations, applying appropriate tools, and producing structured, defensible reports. ModelingBench also supports multiple valid solutions, capturing the ambiguity and creativity of practical modeling. We also present ModelingAgent, a multi-agent framework that coordinates tool use, supports structured workflows, and enables iterative self-refinement to generate well-grounded, creative solutions. To evaluate outputs, we further propose ModelingJudge, an expert-in-the-loop system leveraging LLMs as domain-specialized judges assessing solutions from multiple expert perspectives. Empirical results show that ModelingAgent substantially outperforms strong baselines and often produces solutions indistinguishable from those of human experts. Together, our work provides a comprehensive framework for evaluating and advancing real-world problem-solving in open-ended, interdisciplinary modeling challenges.
摘要
大语言模型(LLMs)的最新进展在解决数学问题方面取得了显著突破。然而,现有基准测试往往无法反映现实世界问题的复杂性,这些问题需要开放式的跨学科推理以及计算工具的整合。为填补这一空白,我们提出了ModelingBench——一个新颖的基准测试,其灵感来源于现实世界,包含从城市交通优化到生态系统资源规划等多个领域的数学建模竞赛中的开放式问题。这些任务要求将自然语言转化为正式的数学表述,应用适当的工具,并生成结构化的、可辩护的报告。ModelingBench还支持多种有效解决方案,以捕捉实际建模中的模糊性和创造性。我们还提出了ModelingAgent,这是一个多智能体框架,能够协调工具使用、支持结构化工作流程,并实现迭代自我优化,从而生成有据可依的创造性解决方案。为了评估输出结果,我们进一步提出了ModelingJudge,这是一个专家参与循环的系统,利用LLMs作为领域专业评委,从多个专家视角评估解决方案。实证结果表明,ModelingAgent显著优于强基线模型,其生成的解决方案往往与人类专家的方案难以区分。总之,我们的工作为评估和推进开放式跨学科建模挑战中的现实问题解决提供了一个全面框架。
Reinforcement Learning from User Feedback
Abstract
arXiv:2505.14946v1 Announce Type: new Abstract: As large language models (LLMs) are increasingly deployed in diverse user facing applications, aligning them with real user preferences becomes essential. Existing methods like Reinforcement Learning from Human Feedback (RLHF) rely on expert annotators trained on manually defined guidelines, whose judgments may not reflect the priorities of everyday users. We introduce Reinforcement Learning from User Feedback (RLUF), a framework for aligning LLMs directly to implicit signals from users in production. RLUF addresses key challenges of user feedback: user feedback is often binary (e.g., emoji reactions), sparse, and occasionally adversarial. We train a reward model, P[Love], to predict the likelihood that an LLM response will receive a Love Reaction, a lightweight form of positive user feedback, and integrate P[Love] into a multi-objective policy optimization framework alongside helpfulness and safety objectives. In large-scale experiments, we show that P[Love] is predictive of increased positive feedback and serves as a reliable offline evaluator of future user behavior. Policy optimization using P[Love] significantly raises observed positive-feedback rates, including a 28% increase in Love Reactions during live A/B tests. However, optimizing for positive reactions introduces reward hacking challenges, requiring careful balancing of objectives. By directly leveraging implicit signals from users, RLUF offers a path to aligning LLMs with real-world user preferences at scale.
摘要
随着大语言模型(LLMs)在多样化用户端应用中的日益普及,使其与真实用户偏好保持一致变得至关重要。现有方法如基于人类反馈的强化学习(RLHF)依赖于经过人工定义准则培训的专家标注者,其判断可能无法反映普通用户的优先级。我们提出基于用户反馈的强化学习(RLUF),该框架通过直接利用生产环境中用户的隐式信号来实现LLMs的对齐。RLUF解决了用户反馈的关键挑战:用户反馈通常是二元化的(如表情符号反应)、稀疏的且偶尔具有对抗性。我们训练了一个奖励模型P[Love]来预测LLM回复获得"爱心反应"(一种轻量级正向用户反馈形式)的概率,并将P[Love]与有用性和安全性目标共同整合到多目标策略优化框架中。大规模实验表明,P[Love]能有效预测正向反馈的增长,并可作为未来用户行为的可靠离线评估指标。使用P[Love]进行策略优化显著提升了观测到的正向反馈率,包括在实时A/B测试中"爱心反应"增加28%。然而,优化正向反应会引发奖励破解挑战,需要谨慎平衡各项目标。通过直接利用用户的隐式信号,RLUF为大规模实现LLMs与现实用户偏好的对齐提供了可行路径。
Self-Evolving Curriculum for LLM Reasoning
Abstract
arXiv:2505.14970v1 Announce Type: new Abstract: Reinforcement learning (RL) has proven effective for fine-tuning large language models (LLMs), significantly enhancing their reasoning abilities in domains such as mathematics and code generation. A crucial factor influencing RL fine-tuning success is the training curriculum: the order in which training problems are presented. While random curricula serve as common baselines, they remain suboptimal; manually designed curricula often rely heavily on heuristics, and online filtering methods can be computationally prohibitive. To address these limitations, we propose Self-Evolving Curriculum (SEC), an automatic curriculum learning method that learns a curriculum policy concurrently with the RL fine-tuning process. Our approach formulates curriculum selection as a non-stationary Multi-Armed Bandit problem, treating each problem category (e.g., difficulty level or problem type) as an individual arm. We leverage the absolute advantage from policy gradient methods as a proxy measure for immediate learning gain. At each training step, the curriculum policy selects categories to maximize this reward signal and is updated using the TD(0) method. Across three distinct reasoning domains: planning, inductive reasoning, and mathematics, our experiments demonstrate that SEC significantly improves models' reasoning capabilities, enabling better generalization to harder, out-of-distribution test problems. Additionally, our approach achieves better skill balance when fine-tuning simultaneously on multiple reasoning domains. These findings highlight SEC as a promising strategy for RL fine-tuning of LLMs.
摘要
强化学习(RL)已被证明能有效微调大语言模型(LLMs),显著提升其在数学和代码生成等领域的推理能力。影响RL微调成功的关键因素是训练课程——即训练问题呈现的顺序。虽然随机课程作为常见基线,但其效果仍欠佳;手动设计的课程通常严重依赖启发式方法,而在线过滤方法可能计算成本过高。为解决这些局限,我们提出自进化课程(SEC),这是一种在RL微调过程中同步学习课程策略的自动课程学习方法。该方法将课程选择建模为非平稳多臂老虎机问题,将每个问题类别(如难度级别或问题类型)视为独立臂。我们利用策略梯度方法的绝对优势作为即时学习收益的代理指标。在每一步训练中,课程策略选择能最大化该奖励信号的类别,并通过TD(0)方法进行更新。在规划、归纳推理和数学三个不同推理领域的实验中,SEC显著提升了模型的推理能力,使其能更好地泛化至更难的分布外测试问题。此外,当在多个推理领域同时微调时,该方法能实现更好的技能平衡。这些发现表明SEC是LLMs强化学习微调的一种有效策略。
When to Continue Thinking: Adaptive Thinking Mode Switching for Efficient Reasoning
Abstract
arXiv:2505.15400v1 Announce Type: new Abstract: Large reasoning models (LRMs) achieve remarkable performance via long reasoning chains, but often incur excessive computational overhead due to redundant reasoning, especially on simple tasks. In this work, we systematically quantify the upper bounds of LRMs under both Long-Thinking and No-Thinking modes, and uncover the phenomenon of "Internal Self-Recovery Mechanism" where models implicitly supplement reasoning during answer generation. Building on this insight, we propose Adaptive Self-Recovery Reasoning (ASRR), a framework that suppresses unnecessary reasoning and enables implicit recovery. By introducing accuracy-aware length reward regulation, ASRR adaptively allocates reasoning effort according to problem difficulty, achieving high efficiency with negligible performance sacrifice. Experiments across multiple benchmarks and models show that, compared with GRPO, ASRR reduces reasoning budget by up to 32.5% (1.5B) and 25.7% (7B) with minimal accuracy loss (1.2% and 0.6% pass@1), and significantly boosts harmless rates on safety benchmarks (up to +21.7%). Our results highlight the potential of ASRR for enabling efficient, adaptive, and safer reasoning in LRMs.
摘要
大型推理模型(LRMs)通过长推理链实现了卓越性能,但由于冗余推理(尤其在简单任务上)常导致过高计算开销。本研究系统量化了LRMs在"长思考"与"无思考"模式下的性能上限,揭示了模型在答案生成过程中隐式补充推理的"内部自恢复机制"现象。基于此发现,我们提出自适应自恢复推理框架(ASRR),通过抑制非必要推理并启用隐式恢复机制,结合精度感知的长度奖励调节,根据问题难度自适应分配推理资源,以可忽略的性能代价实现高效推理。跨多基准和模型的实验表明:相较于GRPO,ASRR在1.5B和7B模型上分别最高减少32.5%和25.7%的推理预算(仅损失1.2%和0.6%的pass@1准确率),并在安全基准上显著提升无害率(最高+21.7%)。研究结果证明了ASRR在实现高效、自适应且更安全的LRMs推理方面的潜力。
lmgame-Bench: How Good are LLMs at Playing Games?
Abstract
arXiv:2505.15146v1 Announce Type: new Abstract: Playing video games requires perception, memory, and planning, exactly the faculties modern large language model (LLM) agents are expected to master. We study the major challenges in using popular video games to evaluate modern LLMs and find that directly dropping LLMs into games cannot make an effective evaluation, for three reasons -- brittle vision perception, prompt sensitivity, and potential data contamination. We introduce lmgame-Bench to turn games into reliable evaluations. lmgame-Bench features a suite of platformer, puzzle, and narrative games delivered through a unified Gym-style API and paired with lightweight perception and memory scaffolds, and is designed to stabilize prompt variance and remove contamination. Across 13 leading models, we show lmgame-Bench is challenging while still separating models well. Correlation analysis shows that every game probes a unique blend of capabilities often tested in isolation elsewhere. More interestingly, performing reinforcement learning on a single game from lmgame-Bench transfers both to unseen games and to external planning tasks. Our evaluation code is available at https://github.com/lmgame-org/GamingAgent/lmgame-bench.
摘要
电子游戏操作需要感知、记忆与规划能力,这正是现代大语言模型(LLM)智能体被要求掌握的核心能力。本研究分析了利用主流电子游戏评估现代LLM的主要挑战,发现直接将其植入游戏无法实现有效评估,原因有三——脆弱的视觉感知、提示敏感度及潜在数据污染。为此,我们推出lmgame-Bench评估框架,通过标准化方法将游戏转化为可靠评估工具。该框架集成平台跳跃、解谜与叙事类游戏,通过统一Gym风格API交付,配备轻量级感知与记忆支架,旨在稳定提示差异并消除数据污染。基于13个前沿模型的测试表明,lmgame-Bench在保持高区分度的同时具备足够挑战性。相关性分析显示,每款游戏都能探测模型独特的能力组合,这些能力在其他测试中往往被孤立检验。更有趣的是,在lmgame-Bench单个游戏上进行的强化学习,其能力可迁移至未见游戏及外部规划任务。评估代码已开源:https://github.com/lmgame-org/GamingAgent/lmgame-bench。
ClickSight: Interpreting Student Clickstreams to Reveal Insights on Learning Strategies via LLMs
Abstract
arXiv:2505.15410v1 Announce Type: new Abstract: Clickstream data from digital learning environments offer valuable insights into students' learning behaviors, but are challenging to interpret due to their high dimensionality and granularity. Prior approaches have relied mainly on handcrafted features, expert labeling, clustering, or supervised models, therefore often lacking generalizability and scalability. In this work, we introduce ClickSight, an in-context Large Language Model (LLM)-based pipeline that interprets student clickstreams to reveal their learning strategies. ClickSight takes raw clickstreams and a list of learning strategies as input and generates textual interpretations of students' behaviors during interaction. We evaluate four different prompting strategies and investigate the impact of self-refinement on interpretation quality. Our evaluation spans two open-ended learning environments and uses a rubric-based domain-expert evaluation. Results show that while LLMs can reasonably interpret learning strategies from clickstreams, interpretation quality varies by prompting strategy, and self-refinement offers limited improvement. ClickSight demonstrates the potential of LLMs to generate theory-driven insights from educational interaction data.
摘要
数字学习环境中的点击流数据为理解学生学习行为提供了宝贵洞见,但由于其高维度和细粒度特性,解读存在挑战。现有方法主要依赖手工特征工程、专家标注、聚类或监督模型,普遍存在泛化性和可扩展性不足的问题。本研究提出ClickSight——一种基于大语言模型(LLM)的情境化分析流程,通过解读学生点击流揭示其学习策略。该系统以原始点击流和学习策略列表作为输入,生成描述学生交互行为的文本解释。我们评估了四种不同的提示策略,并探究了自我优化对解释质量的影响。实验涵盖两个开放式学习环境,采用基于量表的领域专家评估。结果表明:虽然大语言模型能够合理地从点击流中解读学习策略,但解释质量因提示策略而异,且自我优化带来的改进有限。ClickSight证实了大语言模型从教育交互数据中生成理论驱动型洞见的潜力。